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Abstract 

Realistic 3I?-conformations of protein structures can be embedded in a cubic lattice using exclu- 
sively integer numbers, additions, subtractions and boolean operations. 



1. Introduction 

In previous papers [1 — 5] we have built a series of mathematical tools for studying the multidimen- 
sional molecular conformational space of biological macromolecules, with the aim of understanding the 
dynamical states of proteins by building a complete energy surface [6,7]. 

An A^-atom molecule has a (A^ — 1) ^-dimensional conformational space (CS), the sheer complexity of 
this huge structure can be reduced to tractable dimensions by partitioning it with central hyp;rplane£| 
into a finite set of cells, this amounts to discarding all knowledge about molecular conformations other 
than the cells that contain them. 

In our approach [1], a set TL of N-u — N x {N — l)/2 hyperplanes generates a partition in CS of 
A^!'^ cells, on the other hand hyperplanes are oriented structures dividing the space into a -I- and a — 
half-spaces, thus points within a cell are characterized by a binary sequence of length N-^ enumerating 
the orientations with respect the hyperplane set. This binary sequence is all the information that 
remains from the molecular conformations. 

Our choice of hyperplanes {Hij E Ti : Ci — Cj = 0, < i < j < A — 1, c G {x, y, [1], is such 
that the -I-/— hemispaces are the points with Ci > Cj and < Cj respectively. This induces an order 
relation in the x, y and z coordinates of points in a cell 

where {ao, ai, Q!2, 0^^-21 oat-i}, a permutation of the sequence {0, 1, 2, A^ — 2, A — 1}, is the 
dominance partition sequence (DPS)[1]. 

^That pass through the origin. 

■^A convention used here is that c represents any of the cartesian coordinates x, y, z. 
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The compactedness and hierarchical structure of the codes generated by partition sequences made 
possible the construction of a graph whose nodes are the cells in CS that are visited by the thermalized 
molecule with edges towards adjacent cells, this was the subject developped in previous works [2 — 5]. 

However interesting this result may be, it is of no practical use unless on top of it there is a 
method for calculating the energy of molecular conformations in a cell. With the mesoscopic force 
field approximations currently used in molecular simulations [8,9], where atoms are represented as 
point-like structures, the only input to the Hamiltonian energy function are the interatomic distances 
calculated from 3D molecular conformations. In this framework the purpose of this work is twofold: 

1. given a partition sequence, we want to calculate a fair sample of compatible 3D molecular 
conformations, 

2. we want to encode the set of sampled conformations with a combinatorial structure so they can 
be more easily manipulated. 

In the following sections are described the algorithms for doing this: 

• In section 2 we build a complete set of lattice covalent bond segments, which are the basic 
building blocks: the whole molecular structure is built upon them. 

• The DPSs can be seen as the lattice projections of a molecular structure where all intervals in 

each dimension are reduced to one lattice spacing (Fig. 5 of [1]), these have to be increased 
locally to obtain a realistic structure. In section 3 we build the partially ordered set of lattice 
intervals between bonded atoms, a structure needed for calculating the maximum and minimum 
expansion values of each interval, this gives a set of linear inequalities described in section 4. 

• In section 5 it is shown how an inter-dependent system of inequalities can be made independent. 

• In section 6 the form and structure of the system of linear inequalities is discussed in detail. 
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Figure 1: Stereoview of a pancreatic trypsin inhibitor protein (PTI) C^-backbone molecular confor- 
mation (Table I), corresponding to the dominance partition sequences in Fig. 2. 

To illustrate the algorithmic methods that are the subject of the present work, we have chosen 
as an example (Fig. 1 and Table I) the Ca-backbone of the pancreatic trypsin inhibitor protein [10], 
because it is a small protein molecule and the mathematical structures it generates are of moderate 
size, yet it has the complexity that can be found in longer molecules. Also the side chains have been 
put aside for the same reason: they would have made the contents of Figs. 2 and 3 almost unreadable. 
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Figure 2: Dominance partition sequence of the PTI Cc-backbone for the molecular conformation from 
Fig. 1. Showing the maximal intervals for each coordinate. 



2. The expanded lattice covalent bond segments set 

The numbers in x, y and z dominance partition sequences can be regarded as the evenly spaced 
projections of N points in a 3D cubic lattice, it is a particular form of embedding where the separation 
between consecutive projections of atoms in x, y and z has been shrinked to one lattice spacing. The 
aim of the present work is to expand this embedding so to obtain realistic molecular structures. 

To do this we must restrict the most basic element of molecular structures: the covalent bond, 
to a finite set of coordinate values, such that with a suitable unit of length can be transformed to 
give integer values exclusively. These restricted bonds can still be useful for describing real molecular 
conformations if the minimum magnitude of vector differences is small enough. This can be done, 
for the example developped here {PTI Ca-backbone), using empirical data sampled from molecular 
dynamics simulations [11], it requires the following steps 

1. First we determine the dimensions of the lattice by taking as reference the mean bond length 
and its range of variation for bonded Ca pairs, in our case this gives: 3.58A< 3.86A< 4.13A. 
We set arbitrarily the bond mean length to 20 lattice units, which gives a lattice spacing of 
O.I9A. Thus, any segment between two lattice points with a length range between 3.58 x 20/3.86 
and 4.13 x 20/3.86 is potentially a Ca-Ca bond segment, and the set B of valid lattice bond 
segments, modulo a lattice translation along the x, y and z axes, is the set of segments starting 
at the origin and ending in any lattice point that lies between two spheres of radius 3.58 x 20/3.86 
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and 4.13 x 20/3.86 respectively. This gives a total of 1883 primary segments, excluding reflections 
through the xy, xz and yz planes. 

2. Next we determine the range of variation for the bond angles, which is greater than that for the 
bond length and varies considerably along the enchain. For each bond angle Aq-^q.^j^q.^^ we 
determine two integer numbers : the floored minimum \jnin{Aa-^ai+i,ai+2)\ ^^'^ the ceiled max- 
imum range [max{Aa■^ai+l,ai+2)^ respectively. These divide the interval between the absolute 
minimum and maximum values 71° — 167° in 64 subintervals 

71°-74°-75°-76°-77°-78°-79°-80°-81°-82°-87°-89°- 

90°-92°-93°-94°-95°-96°-97°-98°-99°-100°-101°- 

104°-105°-106°-107°-108°-109°-110°-112°-113°- 

114°-115°-116°-117°-118°-119°-120°-121°-124°- 

125°-127°-129°-135°-136°-138°-139°-143°-144°- 

147°-148°-149°-150°-151°-152°-153°-154°-155°- 

156°-157°-159°-162°-163°-167° (2) 

3. The dynamic values of each A^.^q.^j^q.^j spann a given range of intervals from (2), thus consec- 
utive bonds Baand B^+i can only be assigned discrete bond segments that form an angle within 
the specific range. 

In building realistic 3D-conformations from the DPSs by embedding these in a bigger lattice, the 
following problem arises: the intervals Cq.^ — Ca^^jbetween consecutive C^s, for a given coordinate in 
Fig. 2. must be replaced by lattice intervals which are generally longer, so the excess lattice units must 
be distributed among the intermediate sequence intervals, such that the resulting lattice segments 
bonding CqS are from the set of valid lattice bond segments described above. 

To solve this problem the following steps are needed 

1. build from the DPSs the consecutive Cq intervals poset (Fig. 3), 

2. determine for each consecutive Cq. interval the maximum an minimum excess values, 

3. make the linear inequalities in x, y and z independent of one another. 

3. The consecutive intervals poset 

Fig. 2 shows the DPSs for the PTI Ca-backbone, it also shows some of the intervals between con- 
secutive CqS : 2^s[f|, a partial order relation can be defined for them. But first, we recall some basic def- 
initions : let X^^ and I^^ be two I^s spanning the DPSc intervals {aai^* , ctqV'''" } and {craa"^' , era's'"'" } 

Definition 1 I^^ precedes Ig^ or Ig^ -< 1%^ , 

if Xg^ C Xg^ or equivalently a^l'' > a^^'' and CTaV"" < a^i"'' . 

Definition 2 I^^ succeeds I^^ or ^I^^ . 

Definition 3 A maximal interval is not succeeded by any other interval. 
Definition 4 A minimal interval is not preceeded by any other interval. 
Fig. 2 shows the set of maximal intervals for DPS^, DPSy and DPS z- 

Definition 5 A cover is a set of two intervals 2^j-< X'g^ with no I^^such that -<I^^ -<X^2 . 

^The following naming convention applies to any symbol refering to a bond interval Ca — Ca+i '■ it bears only the 
smaller index. 
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Figure 3: Consecutive Ca intervals cover graph. Minimal/maximal intervals are at the bottom/top 
respectively, with succession going from bottom to top. For each interval 

Fig. 3 displays a graphical representation of this partially ordered set (poset), where the nodes are 
the 1° set and the edge set consists of the pairs satisfying the cover relation. As we shall see below the 
poset structure allows to define the set of linear inequalities for determining the lattice bond segments. 
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4. Determining the bounds on excess values 



The excess value of an interval is the difference between its length on the DPS and on the 
extended lattice. In order to expand the DPS lattice we must determine first the bounds of excess 
values for every 1° . 



52 30 49 46 31 50 55 25 54 20 18 23 53 

9 10 11 12 13 14 15 16 17 18 19 20 



Figure 4: Sequence of maximal interval showing its minimal preceeding intervals. 

An example will help to understand, we have in Fig. 4 a set of 5 connected I^s: Ifg which is a 
maximal interval, and its minimal predecessors X|q, Tfg, I^^ and Tgg (Fig. 3), they fill positions 8 to 
20 in the x-sequence where the 12 minimal intervals between C^s have local excess variables Xg to X20 
(Fig. 4), giving the local expansion value in the extended lattice. The following equations define the 
excess values 



vx 
^52 


5^ X(T 

9<<7<20 




(where the last term is the 


c-sequence interval length) 




vx 
^30 


10<CT<12 


1-^30 


11<(T<13 


- I^IqI 


(3) 


vx 
^54 


15<CT<16 


-I^f4 


17<cr<20 


- I^f3l 





also X52 must be greater that the sum of the from preceeding non-overlapping intervals 

^ ^49 + ^54 + ^53 ^52 > -'^fo + ^54 + -^^53 (4) 

To build from (4) a complete system of linear inequalities allowing to calculate the x^s for embed- 
ding the molecular system in the extended lattice, first we need to determine the bounds 



Xmin'i < < Xmax'i 



(5) 



By construction the maximum lattice bond segment length on any coordinate is 21, this gives for 
the extreme values of excess lattice units on any interval the relation 

0<KI+^a<21 (6) 
which settles the initial minimum and maximum bond lattice units for the c-coordinate to 

^mm ^ and 6™"^ = I^Sl + 21 (7) 

respectively. Let Bl " ' " ^ be the set of all lattice bond segments h such that h™™ <bc< 6™"^ for 
c G {a;, y, z}, then the set Bx^ of all the lattice bond segments that are within the bounds (7) is 



Bt^. = B. 



(IB. 



(8) 



This operation may change the bounds (7), this is because the b G Bx^ have a common origin but 
the points at the other extreme form a connected irregular cluster (see the example in Fig. 5): the 
bonds excluded by (8) may be the ones that contain the extremes of other coordinates. This gives a 
new set of bonds and the process has to be repeated until the bounds stabilize. 
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Figure 5: 2D example of a Bx„ set. The lattice bond segments (6) start at the origin, and end on any 
lattice point in the region bounded by the two spheres described in section 2.1. 
The only b end points shown are those lying within the x and y bounds, or equivalently b e Bx^ ■ 
As shown in the picture the set Bx^ can be decomposed into the minimal covering set of np = 4 
rectangular subsets Vx ^ Vx , Vx and . 



5. Making the inequalities independent 

From the set of bounds (7) we can build the set of linear inequalities (using again the example from 
the previous section) 



Xmaxf2 > J2 Xa> Xminl2 

9<CT<20 

10<o-<12 

15<<T<16 



Xmax%Q > E Xct > Xminlg 

11<<7<13 

Xmaxfr^ > X%> Xminl^ 

n<a<20 



(9) 



There is a further problem to be taken into consideration: Xmin% and Xmax% are the c-coordinate 
bounds of the set Bx^ but, due to the non-uniform shape of Bx^, selecting one or more c- values in this 
interval while discarding the rest may change completely the bounds in the other coordinates. This 
the induces an interdependence between inequalities (9) in x, y and z, in which case solving the system 
becomes much more complex. 

This problem can be avoided if the end points of bonds in Bx^ fill completely a lattice rectangular 
parallelopiped, in this case the choice of bounds in one coordinate leaves the others unchanged. Thus 
has to be decomposed into a set of rectangular parallelopipeds Pi^ 



B-j 



U -PI, 



0<p<np 

subject to the following conditions 

1. there are no V'^^ € Pi„ and V^"" G Pi„ such that V^^ C , 



(10) 



2. rip is minimal, 



3. for Pi„ obeying conditions 1 and 2 and T^' G Pi„ there is no such that | < \Pp |. 
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In that case shriking the bounds of a set V^^ for any coordinate does not alter the bounds in the 
other dimensions and thus solutions to the inequalities can be found independently for each coordinate. 

6. The structure of the solutions 

The inequalities (9), for instance, can be rewritten as 

Xmax52 > Uq^20-X^ > Xmin52 ■■■ (11) 

where the Uxgs are {N — l)-dimensional vectors of the form 

C/xj =(0,...,0,1,...,1,0,...,0) (12) 

with ones in the contiguous positions from aa"^* to aa'"''* and zeros everywhere else, and is the 
vector 

X'' = (Xo > X9, X20> Xn-i) (13) 
Extending this notation to the whole set of inequalities for < a < A'' — 1 and x, y and z, we have 
Xmaxj^ > Ux^ -X^ > Xmin^^ 

Xmaxl^ > Uxv.x^ > Xmin^^ (14) 
Xmaxx^ > Uxg -x^ > Xmin^^ 

Taking the vectors Ux^ as the rows of a {N — 1) x (iV — 1) matrix U", and Xmax^^/Xmin^^ as 
the components of vectors Xmax'^ / Xmin'^ (14) can be rewritten as 

Xmax^ > -x^ > Xmin^ Xmaxy > -X^ > Xminy Xma.x^- > f/^.x^ > Xmin^ (15) 

The above set of inequalities define 2 x (A^ — 1) afHne half-spaces Hmin^^ and Hmax^^ whose intersection 
determines an H-polytope in CS^ [12, 13]. Hence, the vertices of this polytope are among the unique 
solutions of the 3 x 2^~^ systems of equations 

W'.x'' = Xlim'' Uy.x^^Xlimy U\x'- = Xlim'^ 0<a<N-l (16) 

where Xlim'^ can be either Xmax'^ or Xmin'^ and the > relation in (15) has been restricted to =. 
Moreover, the matrices V with rows like (12) are called interval matrices, they belong to a very 
important class of matrices called: totally unimodular matrices [12]. These have the particularity 
that the determinant of any minor is either —1, or 1. This ensures that the vertices of the polytope 
are integer vectors (or lattice points), since solving (16) by applying the Cramer's rule the denominator 
is always —1 or 1. Thus, the solutions of (16) can be written 

X" = V.XUm" (17) 

where U'^ is the inverse of V^. 

The V-polytope is the representation of the polytope by its set of vertices, these can be obtained 
from (17) by determining the combinations in Xlirrf compatible with (15). The solutions of the system 
of linear inequalities (15) can be generated from this set through convex combinations, as the three 
sets of inequalities are independent the general solution will be the product of the x, y and z polytopes. 

The total unimodularity of matrix also ensures that most combinatorial algorithms can be run 
in polynomial time. 
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7. Conclusion 



The purpose of the hne of work being developped here, is to show that molecular structures can 
be built and analysed with a fraction of the information (in our case less than 1 /5) that can be found 
in a typical PBD file. 

This might seem a significant but modest quantitative difference, but qualitatively is more than 

that: discarding information results in the emergence of mathematical structures that were buried in 
the complexity of the data, which in turn can be encoded efficiently by them. Using combinatorics 
a great number of molecular conformations can be dealt simultaneously, thus overcoming the barrier 
that computations have to be performed on the basis of one conformation at a time. 
The algorithmic method developped before [1 — 5] serves two purposes 

1. As an amplifier : by codifying data sampled in computer simulations into discrete gemetrical 
structures, these can be combined to generate an estimate of the volume occupied by a molecule 
in its conformational space. 

2. As a molecular SD-structure compressor : it is possible to translate basic features of molecular 

SD-structures into a binary code, which in turn can be very efficiently amalgamated into ternary 
sequences that encode great numbers of cells from CS. The information on the whole CS volume 
can be cast into a file compatible with desktop memory size. 

The present work is the first one of a third and last step: the development of combinatorial methods 
for calculating the energy of structures from cells in CS. 

Here we have developped the basic algorithms for this : realistic discrete protein conformations can 
be built and embedded in a cubic lattice, using a table of discrete bond segments and, more important, 
these conformations can be encoded into combinatorial structures. 

However many issues still remain unexplored: 

• The possible combinations of V^^s from (10) is a huge set, efficient sampling methods should be 
developped. 

• The V-polytope should be better characterized. 

• The present formalism should be extended to take into account sets of adjacent cells. 

• Last of all inter-atomic distances should also be encoded into combinatorial structures. 

These will be dealt in forthcoming works. 
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8. Appendix 

Table 1. Lattice coordinates of the PTI -backbone from Fig. 1. 

Column a : Ca number. 

Columns Xa Ua Za '■ Ca coordinatcs. 

Columns by b^ : bond vector between Ca-i and C^. 



a 








bx 


by 
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a 








bx 


by 


















29 


-34 


49 


-30 


6 


19 


7 


1 


19 


7 


1 


19 


7 


1 


30 


-33 


66 


-40 


1 


17 


-10 


2 


30 


5 


-16 


11 


_2 


-17 


31 


-25 


84 


-38 


8 


18 


2 


3 


31 


26 


-19 


1 


21 


-3 


32 


-9 


94 


-44 


16 


10 


-6 


4 


12 


26 


-26 


-19 





.7 


33 


-1 


113 


-41 


8 


19 


3 


5 


19 


19 


-43 


7 


_7 


-17 


34 


14 


111 


-29 


15 


-2 


12 


6 


28 


35 


-49 


9 


16 


-6 


35 


30 


123 


-29 


16 


12 





7 


19 


52 


-53 


-9 


17 


-4 


36 


38 


117 


-12 


8 


-6 


17 


8 


27 


68 


-45 




16 


8 


37 


55 


106 


-15 


17 


-11 


-3 


9 


29 


86 


-53 


2 


18 


-8 


38 


64 


91 


-25 


9 


-15 


-10 


10 


28 


106 


-48 


-1 


20 


5 


39 


50 


80 


-34 


-14 


-11 


-9 


11 


47 


105 


-41 


19 


-1 


7 


40 


44 


61 


-30 


-6 


-19 


4 


12 


60 


120 


-46 


13 


15 


-5 


41 


36 


55 


-13 


-8 


-6 


17 


13 


54 


131 


-31 


-6 


11 


15 


42 


17 


51 


-19 


-19 


-4 


-6 


14 


40 


145 


-33 


-14 


14 


-2 


43 


12 


69 


-11 


-5 


18 


8 


15 


24 


141 


-21 


-16 


-4 


12 


44 


-3 


68 


2 


-15 


-1 


13 


16 


6 


137 


-29 


-18 


-4 


-8 


45 


-14 


83 


12 


-11 


15 


10 


17 


-3 


126 


-15 


-9 


-11 


14 


46 


-32 


75 


11 


-18 


-8 


-1 


18 


-15 


111 


-21 


-12 


-15 


-6 


47 


-44 


65 


-1 


-12 


-10 


-12 


19 


-8 


93 


-14 


7 


-18 


7 


48 


-50 


50 


10 


-6 


-15 


11 


20 


-16 


76 


-19 


-8 


-17 


-5 


49 


-33 


46 


18 


17 


-4 


8 


21 


-7 


60 


-28 


9 


-16 


-9 


50 


-24 
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1 
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22 


-13 


42 


-32 


-6 


-18 


-4 


51 


-38 


32 


-4 


-14 


-15 


-4 


23 


-15 


41 


-52 


-2 


-1 


-20 


52 


-34 


23 


13 


4 


-9 


17 


24 


-10 


22 


-57 


5 


-19 


-5 


53 


-15 


22 


9 


19 


-1 


-4 


25 


-18 


25 


-75 


-8 


3 


-18 


54 


-17 


18 


-11 


-2 


-4 


-20 


26 


-36 


29 


-67 


-18 


4 


8 


55 


-24 


-1 


-8 


-7 


-19 


3 


27 


-35 


17 


-51 


1 


-12 


16 


56 


-36 


-12 


3 


-12 


-11 


11 


28 


-40 


30 


-37 


-5 


13 


14 


57 


-52 


-15 


-9 


-16 


-3 


-12 
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